An Aspect Based Document Representation for Event Clustering

نویسندگان

  • Wim De Smet
  • Marie-Francine Moens
چکیده

We have studied several techniques for creating and comparing content representations of textual documents in the field of event detection. We define a document as a collection of aspects, i.e. disjoint components that reveal (latent) topics and/or extracted information such as named entities. As underlying models we consider the vector space model and probabilistic topic models based on Latent Dirichlet Allocation. We also investigate the value of dependencies between the aspects, which are reflected by importance factors. We apply and evaluate our techniques on event detection in Wikinews, where we cluster news stories that discuss the same event. We found that the split representations yield the best event detection results compared to the ground-truth event clusters. Our methods for aspect detection, for learning the importance factors of the aspects, and for event clustering are completely unsupervised.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

Language Model-Based Document Clustering Using Random Walks

We propose a new document vector representation specifically designed for the document clustering task. Instead of the traditional termbased vectors, a document is represented as an -dimensional vector, where is the number of documents in the cluster. The value at each dimension of the vector is closely related to the generation probability based on the language model of the corresponding docum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009